##Introduction
Accurately projecting each player’s score is critical in the dynamic field of basketball analytics. This number is critical in determining a player’s offensive ability and overall value to their club. It’s a tool used by coaches, analysts, and fans to evaluate scoring ability, assisting in game judgments, player recruitment, and scouting.
This notebook will investigate the use of several machine-learning approaches for forecasting basketball points. We will concentrate on four different regression models. The goal of the K-Nearest Neighbors (KNN) Regressor, Decision Tree Regressor (DT), and Random Forest Regressor (RFR) models is to anticipate a player’s total points based on a variety of performance indicators like as time spent playing, successful field goals, free throws, and so on.
Our goal is to evaluate and examine the efficacy of various models in forecasting basketball scores. This comparative analysis will help us understand the advantages and disadvantages of each technique, directing us to the most successful model for this specific dataset.
Join us as we delve into the fascinating world of basketball data analytics, assessing and comparing the predictions from each regression model to determine their effectiveness in projecting players’ scoring contributions.
This notebook contains tasks. • Dataset Overview: Learn about the basketball dataset’s structure and properties. • Import libraries: Add the libraries required for data manipulation and visualization.
• Read datasets and extract information from them: Load the dataset and collect preliminary insights.
• Data visualization: Use visualization to better comprehend the distribution and linkages of data.
• Features: Choose the features that will help you anticipate basketball points.
##Fashion modeling:
• KNeighbors Regressor: For point prediction, use the KNN Regressor.
• Decision Tree Regressor: Use the DT Regressor to predict points.
• Random Forest Regressor: Use RFR to predict points.
• Predictions visualization: Visualize and assess the predictions provided by each regression model.
# Load libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.3
library(plotly)
## Warning: package 'plotly' was built under R version 4.2.3
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(caret)
## Warning: package 'caret' was built under R version 4.2.3
## Loading required package: lattice
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.2.3
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(kknn)
## Warning: package 'kknn' was built under R version 4.2.3
##
## Attaching package: 'kknn'
## The following object is masked from 'package:caret':
##
## contr.dummy
library(rpart)
library(tree)
## Warning: package 'tree' was built under R version 4.2.3
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.2.3
# Supress warnings
options(warn=-1)
The DataFrame df in this code holds basketball performance statistics for players. The DataFrame columns have unique names that may or may not be user-friendly or self-explanatory. The purpose of this code is to rename the columns to make them more meaningful and intelligible.
# Load the dataset
df <- read.csv('2023_nba_player_stats.csv')
# View the first few rows of the dataset
head(df, 3)
## PName POS Team Age GP W L Min PTS FGM FGA FG. X3PM X3PA X3P.
## 1 Jayson Tatum SF BOS 25 74 52 22 2732.2 2225 727 1559 46.6 240 686 35.0
## 2 Joel Embiid C PHI 29 66 43 23 2284.1 2183 728 1328 54.8 66 200 33.0
## 3 Luka Doncic PG DAL 24 66 33 33 2390.5 2138 719 1449 49.6 185 541 34.2
## FTM FTA FT. OREB DREB REB AST TOV STL BLK PF FP DD2 TD3 X...
## 1 531 622 85.4 78 571 649 342 213 78 51 160 3691 31 1 470
## 2 661 771 85.7 113 557 670 274 226 66 112 205 3706 39 1 424
## 3 515 694 74.2 54 515 569 529 236 90 33 166 3747 36 10 128
# Check the dimensions of the dataset
dim(df)
## [1] 539 30
# Check for duplicate rows
sum(duplicated(df))
## [1] 0
# Rename columns
names(df) <- c('Player_Name', 'Position', 'Team_Abbreviation', 'Age', 'Games_Played', 'Wins', 'Losses', 'Minutes_Played', 'Total_Points', 'Field_Goals_Made', 'Field_Goals_Attempted', 'Field_Goal_Percentage', 'Three_Point_FG_Made', 'Three_Point_FG_Attempted', 'Three_Point_FG_Percentage', 'Free_Throws_Made', 'Free_Throws_Attempted', 'Free_Throw_Percentage', 'Offensive_Rebounds', 'Defensive_Rebounds', 'Total_Rebounds', 'Assists', 'Turnovers', 'Steals', 'Blocks', 'Personal_Fouls', 'NBA_Fantasy_Points', 'Double_Doubles', 'Triple_Doubles', 'Plus_Minus')
# Display structure of the dataset
str(df)
## 'data.frame': 539 obs. of 30 variables:
## $ Player_Name : chr "Jayson Tatum" "Joel Embiid" "Luka Doncic" "Shai Gilgeous-Alexander" ...
## $ Position : chr "SF" "C" "PG" "PG" ...
## $ Team_Abbreviation : chr "BOS" "PHI" "DAL" "OKC" ...
## $ Age : int 25 29 24 24 28 21 28 26 24 28 ...
## $ Games_Played : int 74 66 66 68 63 79 77 68 73 77 ...
## $ Wins : int 52 43 33 33 47 40 44 44 38 38 ...
## $ Losses : int 22 23 33 35 16 39 33 24 35 39 ...
## $ Minutes_Played : num 2732 2284 2390 2416 2024 ...
## $ Total_Points : int 2225 2183 2138 2135 1959 1946 1936 1922 1914 1913 ...
## $ Field_Goals_Made : int 727 728 719 704 707 707 658 679 597 673 ...
## $ Field_Goals_Attempted : int 1559 1328 1449 1381 1278 1541 1432 1402 1390 1388 ...
## $ Field_Goal_Percentage : num 46.6 54.8 49.6 51 55.3 45.9 45.9 48.4 42.9 48.5 ...
## $ Three_Point_FG_Made : int 240 66 185 58 47 213 218 245 154 204 ...
## $ Three_Point_FG_Attempted : int 686 200 541 168 171 578 636 635 460 544 ...
## $ Three_Point_FG_Percentage: num 35 33 34.2 34.5 27.5 36.9 34.3 38.6 33.5 37.5 ...
## $ Free_Throws_Made : int 531 661 515 669 498 319 402 319 566 363 ...
## $ Free_Throws_Attempted : int 622 771 694 739 772 422 531 368 639 428 ...
## $ Free_Throw_Percentage : num 85.4 85.7 74.2 90.5 64.5 75.6 75.7 86.7 88.6 84.8 ...
## $ Offensive_Rebounds : int 78 113 54 59 137 47 141 63 56 42 ...
## $ Defensive_Rebounds : int 571 557 515 270 605 411 626 226 161 303 ...
## $ Total_Rebounds : int 649 670 569 329 742 458 767 289 217 345 ...
## $ Assists : int 342 274 529 371 359 350 316 301 741 327 ...
## $ Turnovers : int 213 226 236 192 246 259 216 180 300 194 ...
## $ Steals : int 78 66 90 112 52 125 49 99 80 69 ...
## $ Blocks : int 51 112 33 65 51 58 21 27 9 18 ...
## $ Personal_Fouls : int 160 205 166 192 197 186 233 168 104 159 ...
## $ NBA_Fantasy_Points : int 3691 3706 3747 3425 3451 3311 3324 2918 3253 2885 ...
## $ Double_Doubles : int 31 39 36 3 46 9 40 5 40 2 ...
## $ Triple_Doubles : int 1 1 10 0 6 0 0 0 0 0 ...
## $ Plus_Minus : int 470 424 128 149 341 97 170 338 100 18 ...
# Descriptive statistics for numeric variables
summary(df[sapply(df, is.numeric)])
## Age Games_Played Wins Losses
## Min. :19.00 Min. : 1.00 Min. : 0.00 Min. : 0.00
## 1st Qu.:23.00 1st Qu.:30.50 1st Qu.:12.00 1st Qu.:14.00
## Median :25.00 Median :54.00 Median :25.00 Median :25.00
## Mean :25.97 Mean :48.04 Mean :24.02 Mean :24.02
## 3rd Qu.:29.00 3rd Qu.:68.00 3rd Qu.:36.00 3rd Qu.:34.00
## Max. :42.00 Max. :83.00 Max. :57.00 Max. :60.00
## Minutes_Played Total_Points Field_Goals_Made Field_Goals_Attempted
## Min. : 1.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 329.0 1st Qu.: 120.5 1st Qu.: 45.5 1st Qu.: 93.5
## Median : 970.2 Median : 374.0 Median :138.0 Median : 300.0
## Mean :1103.6 Mean : 523.4 Mean :191.6 Mean : 403.0
## 3rd Qu.:1845.9 3rd Qu.: 769.5 3rd Qu.:283.5 3rd Qu.: 598.5
## Max. :2963.2 Max. :2225.0 Max. :728.0 Max. :1559.0
## Field_Goal_Percentage Three_Point_FG_Made Three_Point_FG_Attempted
## Min. : 0.00 Min. : 0.00 Min. : 0.0
## 1st Qu.: 41.65 1st Qu.: 5.00 1st Qu.: 17.0
## Median : 45.50 Median : 36.00 Median :109.0
## Mean : 46.33 Mean : 56.32 Mean :156.1
## 3rd Qu.: 50.60 3rd Qu.: 92.00 3rd Qu.:249.5
## Max. :100.00 Max. :301.00 Max. :731.0
## Three_Point_FG_Percentage Free_Throws_Made Free_Throws_Attempted
## Min. : 0.00 Min. : 0.00 Min. : 0.0
## 1st Qu.: 28.10 1st Qu.: 13.50 1st Qu.: 18.0
## Median : 34.20 Median : 42.00 Median : 60.0
## Mean : 31.53 Mean : 83.95 Mean :107.4
## 3rd Qu.: 38.50 3rd Qu.:113.50 3rd Qu.:147.0
## Max. :100.00 Max. :669.00 Max. :772.0
## Free_Throw_Percentage Offensive_Rebounds Defensive_Rebounds Total_Rebounds
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 66.70 1st Qu.: 10.00 1st Qu.: 36.5 1st Qu.: 50.5
## Median : 76.30 Median : 33.00 Median :118.0 Median :159.0
## Mean : 71.99 Mean : 47.62 Mean :150.6 Mean :198.3
## 3rd Qu.: 84.10 3rd Qu.: 63.00 3rd Qu.:229.5 3rd Qu.:286.0
## Max. :100.00 Max. :274.00 Max. :744.0 Max. :973.0
## Assists Turnovers Steals Blocks
## Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 22.0 1st Qu.: 14.5 1st Qu.: 8.50 1st Qu.: 5.00
## Median : 69.0 Median : 44.0 Median : 28.00 Median : 13.00
## Mean :115.5 Mean : 61.3 Mean : 33.27 Mean : 21.24
## 3rd Qu.:162.5 3rd Qu.: 92.5 3rd Qu.: 51.00 3rd Qu.: 28.00
## Max. :741.0 Max. :300.0 Max. :128.00 Max. :193.00
## Personal_Fouls NBA_Fantasy_Points Double_Doubles Triple_Doubles
## Min. : 0.00 Min. : -1 Min. : 0.000 Min. : 0.0000
## 1st Qu.: 32.00 1st Qu.: 254 1st Qu.: 0.000 1st Qu.: 0.0000
## Median : 86.00 Median : 810 Median : 0.000 Median : 0.0000
## Mean : 91.18 Mean :1037 Mean : 4.011 Mean : 0.2208
## 3rd Qu.:140.00 3rd Qu.:1646 3rd Qu.: 3.000 3rd Qu.: 0.0000
## Max. :279.00 Max. :3842 Max. :65.000 Max. :29.0000
## Plus_Minus
## Min. :-642
## 1st Qu.: -70
## Median : -7
## Mean : 0
## 3rd Qu.: 57
## Max. : 640
# Descriptive statistics for categorical variables
summary(df[sapply(df, is.character)])
## Player_Name Position Team_Abbreviation
## Length:539 Length:539 Length:539
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
Histogram of Players Positions: This histogram is designed to show the distribution of players across different positions.
Bar Chart of Average Points Per Position This chart visualizes the average total points scored by players in each position. - Players at the position PG has the highest average total points followed by the position SG, SF, PF, and the others.
# Checking for missing values
colSums(is.na(df))
## Player_Name Position Team_Abbreviation
## 0 0 0
## Age Games_Played Wins
## 0 0 0
## Losses Minutes_Played Total_Points
## 0 0 0
## Field_Goals_Made Field_Goals_Attempted Field_Goal_Percentage
## 0 0 0
## Three_Point_FG_Made Three_Point_FG_Attempted Three_Point_FG_Percentage
## 0 0 0
## Free_Throws_Made Free_Throws_Attempted Free_Throw_Percentage
## 0 0 0
## Offensive_Rebounds Defensive_Rebounds Total_Rebounds
## 0 0 0
## Assists Turnovers Steals
## 0 0 0
## Blocks Personal_Fouls NBA_Fantasy_Points
## 0 0 0
## Double_Doubles Triple_Doubles Plus_Minus
## 0 0 0
# Handling missing values (e.g., filling NA in 'Position' with 'SG')
df$Position[is.na(df$Position)] <- 'SG'
# Histogram of 'Position' using ggplot2
library(ggplot2)
ggplot(df, aes(x = Position)) +
geom_histogram(stat = "count", fill = "blue") +
theme_minimal() +
labs(title = 'Players Position Value Counts', x = 'Position', y = 'Count')
# Alternatively, using plotly
library(plotly)
fig <- plot_ly(df, x = ~Position, type = "histogram")
fig
# Average points per position using ggplot2
position_stats <- df %>%
group_by(Position) %>%
summarize(Average_Total_Points = mean(Total_Points, na.rm = TRUE))
ggplot(position_stats, aes(x = Position, y = Average_Total_Points, fill = Position)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(title = 'Average Points per Position', x = 'Position', y = 'Average Total Points')
# Alternatively, using plotly
fig <- plot_ly(position_stats, x = ~Position, y = ~Average_Total_Points, type = "bar")
fig
Age vs. Field Goal Percentage:This graph depicts the association between player age and field goal percentage.It’s useful to know if there’s a link between a player’s age and their shooting effectiveness. - More field goals were accounted within the age of 23 to 25 and the least was accounted from 31 onward.
Age vs. Assists : This graph shows how the ages of players correlate with the number of assists they make.It aids in determining whether certain age groups are more susceptible. - Although the scatter plot doesn’t show much of a variation in the number of assists for different age group, it does however show players from which position tend to assist more and we can observe a high number of assists from PG position.
3)Bar Charts: Average Fantasy Points by Position:The average fantasy points scored by players in each position are shown in this bar chart.It allows you to compare the average fantasy point performance of players in various positions. - We can observe that, the players at the position PG has got the highest fantasy points and the players at the position G has gotten the least.
# Histogram of Player Ages
ggplot(df, aes(x = Age)) +
geom_histogram(binwidth = 1, fill = "Blue") +
labs(title = "Distribution of Player Ages", x = "Age", y = "Count")
# Scatter Plots
# Age vs. Total Points
ggplot(df, aes(x = Age, y = Total_Points, color = Position)) +
geom_point() +
labs(title = "Player Age vs Total Points", x = "Age", y = "Total Points")
# Age vs. Field Goal Percentage
ggplot(df, aes(x = Age, y = Field_Goal_Percentage, color = Position)) +
geom_point() +
labs(title = "Player Age vs Field Goal Percentage", x = "Age", y = "Field Goal Percentage")
# Age vs. Assists
ggplot(df, aes(x = Age, y = Assists, color = Position)) +
geom_point() +
labs(title = "Player Age vs Assists", x = "Age", y = "Assists")
# Bar Charts
# Average Fantasy Points by Position
avg_fantasy_points <- df %>%
group_by(Position) %>%
summarize(Avg_Fantasy_Points = mean(NBA_Fantasy_Points, na.rm = TRUE))
ggplot(avg_fantasy_points, aes(x = Position, y = Avg_Fantasy_Points, fill = Position)) +
geom_bar(stat = "identity") +
labs(title = "Average Fantasy Points by Position", x = "Position", y = "Average Fantasy Points")
# Double and Triple Doubles by Position
double_doubles_by_position <- df %>%
group_by(Position) %>%
summarize(Double_Doubles = sum(Double_Doubles, na.rm = TRUE))
triple_doubles_by_position <- df %>%
group_by(Position) %>%
summarize(Triple_Doubles = sum(Triple_Doubles, na.rm = TRUE))
#Boxplots
These box plots can identify patterns and abnormalities in a variety of player performance measures. For example, a box plot for ‘Total Points’ may show the distribution of points scored by players throughout different games or seasons, highlighting the average range of scores as well as any extraordinary performances.
This visualization method is very useful in exploratory data analysis since it provides insights into the nature of the data that can inform subsequent studies or model-building.
These box plots are used to visually and compare the distributions of various basketball-related statistics, assisting in the understanding of the data structure, identifying any data quality issues, and gaining insights into player performances.
library(ggplot2)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:randomForest':
##
## combine
## The following object is masked from 'package:dplyr':
##
## combine
# Prepare data (excluding certain columns)
column_to_exclude <- c('Player_Name', 'Position', 'Team_Abbreviation')
columns <- setdiff(names(df), column_to_exclude)
# Create a list to store plots
plot_list <- vector("list", length(columns))
# Adjust the number of rows and columns for the layout
num_columns_layout <- 2 # You can adjust this
num_rows_layout <- ceiling(length(plot_list) / num_columns_layout)
for (i in seq_along(columns)) {
p <- ggplot(df, aes_string(y = columns[i])) +
geom_boxplot() +
theme_minimal() +
ggtitle(paste("Box Plot of", columns[i]))
print(p) # Display each plot individually
}
The dataset is preprocessed in this code portion by separating it into training and testing sets. X stores the independent variables (features), while y stores the dependent variable (target). To divide the data into training and testing subsets, we used the train_test_split function. The training set contains 80% of the data, with the remaining 20% in the testing set. For reproducibility, the random state is set to 42.
The dimensions of the training and testing sets are printed to the console after splitting. X_train has rows_train and columns_train, whereas X_test contains rows_test and columns_test.
# Assuming df is your dataframe and 'Total_Points' is the target variable
set.seed(5555) # for reproducibility
# Remove columns whose names start with "PName"
updatedDf <- df[, !grepl("^Player_Name", names(df))]
trainIndex <- createDataPartition(updatedDf$Total_Points, p = .7, list = FALSE)
dataTrain <- updatedDf[ trainIndex,]
dataTest <- updatedDf[-trainIndex,]
##Random Forest A Random Forest regressor model is subjected to hyper parameter adjustment in this part to improve its performance. The goal is to determine the best hyper parameter combination that produces the highest R2 score on the test data. For experimentation, a set of test sizes and random states is defined. The dataset is repeatedly split into training and testing sets using varying test sizes and random states within nested loops. A Random Forest regressor model with specific hyper parameters, including 100 estimators and a maximum depth of 5, is instantiated for each combination of test size and random state. On the training data (X_train and y_train), the model is trained. The R2 score is calculated using r2_score on the testing data (X_test).
library(randomForest)
library(caret)
library(Metrics)
##
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
##
## precision, recall
# Train the Random Forest model
rf_model <- randomForest(Total_Points ~ ., data = dataTrain, ntree = 5)
# Make predictions
rf_predictions <- predict(rf_model, dataTest)
# Evaluate the model
rf_mse <- mse(dataTest$Total_Points, rf_predictions)
rf_r2 <- R2(dataTest$Total_Points, rf_predictions)
rf_mae <- mae(dataTest$Total_Points, rf_predictions)
print(paste("MSE:", rf_mse))
## [1] "MSE: 2853.47921756944"
print(paste("R2 score:", rf_r2))
## [1] "R2 score: 0.987580929929807"
print(paste("mae score:", rf_mae))
## [1] "mae score: 32.8888125"
A K-Nearest Neighbors (KNN) regressor model is trained and assessed on the dataset in this code section. Basketball players’ ‘Total_Points’ are predicted using the KNN regressor. For reproducibility, the dataset is divided into training and testing sets with varying test sizes (15%, 20%, 25%, and 30%) and random states.The model is trained on the training data (X_train and y_train) for each combination of test size and random state. The trained model is then used to forecast the ‘Total_Points’ on the testing data (X_test), with the results saved in y_pred.
The R2 score is calculated using the r2_score function, which quantifies the fraction of the variance in the dependent variable that is predictable from the independent variables. If the calculated R2 score is higher than the previous best result, the test size, random state, and R2 score are updated.
The code reports the optimal test size, random state, and R2 score obtained by the KNN regressor after iterating through all combinations.
#Library for the KNN model
library(kknn)
# Train the KNN model
head(dataTest)
## Position Team_Abbreviation Age Games_Played Wins Losses Minutes_Played
## 1 SF BOS 25 74 52 22 2732.2
## 9 PG ATL 24 73 38 35 2540.7
## 10 SG CHI 28 77 38 39 2767.9
## 16 PF UTA 25 66 32 34 2272.5
## 18 SG HOU 21 76 20 56 2602.2
## 30 SG GSW 33 69 38 31 2278.9
## Total_Points Field_Goals_Made Field_Goals_Attempted Field_Goal_Percentage
## 1 2225 727 1559 46.6
## 9 1914 597 1390 42.9
## 10 1913 673 1388 48.5
## 16 1691 571 1144 49.9
## 18 1683 566 1359 41.6
## 30 1509 546 1252 43.6
## Three_Point_FG_Made Three_Point_FG_Attempted Three_Point_FG_Percentage
## 1 240 686 35.0
## 9 154 460 33.5
## 10 204 544 37.5
## 16 200 510 39.2
## 18 187 554 33.8
## 30 301 731 41.2
## Free_Throws_Made Free_Throws_Attempted Free_Throw_Percentage
## 1 531 622 85.4
## 9 566 639 88.6
## 10 363 428 84.8
## 16 349 399 87.5
## 18 364 463 78.6
## 30 116 132 87.9
## Offensive_Rebounds Defensive_Rebounds Total_Rebounds Assists Turnovers
## 1 78 571 649 342 213
## 9 56 161 217 741 300
## 10 42 303 345 327 194
## 16 130 440 570 123 127
## 18 43 241 284 281 200
## 30 39 247 286 163 123
## Steals Blocks Personal_Fouls NBA_Fantasy_Points Double_Doubles
## 1 78 51 160 3691 31
## 9 80 9 104 3253 40
## 10 69 18 159 2885 2
## 16 42 38 137 2673 28
## 18 59 18 131 2476 0
## 30 49 29 130 2208 2
## Triple_Doubles Plus_Minus
## 1 1 470
## 9 0 100
## 10 0 18
## 16 0 163
## 18 0 -447
## 30 0 163
# scale the data for KNN
knn_model <- kknn(Total_Points ~ ., train = dataTrain, test= dataTest, k = 1)
# Make predictions
knn_predictions <- fitted(knn_model)
# Evaluating the model
knn_mse <- mse(dataTest$Total_Points, knn_predictions)
knn_r2 <- R2(dataTest$Total_Points, knn_predictions)
knn_mae <- mae(dataTest$Total_Points, knn_predictions)
print(paste("KNN - MSE:", knn_mse))
## [1] "KNN - MSE: 17012.7125"
print(paste("KNN - R2 score:", knn_r2))
## [1] "KNN - R2 score: 0.927435691388151"
print(paste("KNN - MAE score:", knn_mae))
## [1] "KNN - MAE score: 94.3625"
##Decision Tree Regressor Tuning In this code section, a grid search strategy is used to tune hyperparameters for a Decision Tree regressor model. The goal is to find the ideal hyperparameters that maximize the R2 score, which indicates the predictive performance of the model. The dataset is divided between training and testing sets based on different test sizes (10%, 15%, 20%, and 30%) and random states (0, 1, 42, 43, 100, 313). The default hyperparameters are used to initialize a Decision Tree regressor model. The model is trained on the training data (X_train and y_train) for each combination of test size and random state, and predictions are generated on the testing data (X_test). The R2 score is calculated with r2_score and compared to the best R2 score available.
library(rpart)
# Train the Decision Tree model
dt_model <- rpart(Total_Points ~ ., data = dataTrain, method = "anova")
# Make predictions
dt_predictions <- predict(dt_model, dataTest, type = "vector")
# Evaluate the model
dt_mse <- mse(dataTest$Total_Points, dt_predictions)
dt_r2 <- R2(dataTest$Total_Points, dt_predictions)
dt_mae <- mae(dataTest$Total_Points, dt_predictions)
print(paste("Decision Tree - MSE:", dt_mse))
## [1] "Decision Tree - MSE: 8689.63837496918"
print(paste("Decision Tree - R2 score:", dt_r2))
## [1] "Decision Tree - R2 score: 0.962413315013407"
print(paste("Decision Tree - MAE score:", dt_mae))
## [1] "Decision Tree - MAE score: 70.0075546120961"
#Model Evaluation Visual comparisons between projected and actual points are generated in this section using several sorts of graphs.
Scatter Plot: A scatter plot is created to compare the actual (x-axis) and expected (y-axis) locations. Based on the actual points, each point is color-coded. The scatter plot function in Plotly is used to construct the plot.
A residual plot is created to show the disparities between the actual points and the anticipated points. The difference between the actual and anticipated points is used to determine the residuals. At y=0, a dashed orange line helps you visualize the divergence from the ideal line. The scatter plot function in Plotly is used to construct the plot.
Predicted vs. True Line Plot: This plot shows a comparison of the true (x-axis) and predicted (y-axis) values. The expected values are represented by an ideal line, regression line, and scatter plot. The linear relationship between the true and expected values is represented by the regression line. The scatter plot function in Plotly is used to create the plot. Plotly’s show() function is used to display each plot.
The three plots you’ve created—scatter plot of actual vs. projected, residual plot, and line plot comparing predicted to true values—are crucial tools in regression analysis for evaluating predictive model performance. Each visualization provides a unique perspective on the accuracy and qualities of your model’s predictions.
library(randomForest)
library(ggplot2)
library(plotly)
# Make predictions using the Random Forest model
rf_predictions <- predict(rf_model, dataTest)
# Create a dataframe for comparison
comparison_df <- data.frame(Actual = dataTest$Total_Points, Predicted = rf_predictions)
# Calculate residuals
comparison_df$Residuals <- comparison_df$Actual - comparison_df$Predicted
fig_scatter <- ggplot(comparison_df, aes(x = Actual, y = Predicted, color = Actual)) +
geom_point() +
ggtitle("Comparison of Actual vs. Predicted") +
labs(x = "Actual Points", y = "Predicted Points") +
theme_minimal()
ggplotly(fig_scatter) # Convert to interactive plotly plot
fig_residual <- ggplot(comparison_df, aes(x = Predicted, y = Residuals)) +
geom_point(color = "orangered") +
geom_hline(yintercept = 0, linetype = "dashed", color = "orange") +
ggtitle("Residual Plot") +
labs(x = "Predicted Values", y = "Residuals") +
theme_minimal()
ggplotly(fig_residual) # Convert to interactive plotly plot
fig_line <- ggplot(comparison_df, aes(x = Actual, y = Predicted)) +
geom_point(color = "Green") +
geom_line(aes(y = Actual), color = "#98DFD6") +
geom_smooth(method = lm, color = "#FFDD83", se = FALSE) +
ggtitle("Predicted vs. True Line Plot") +
labs(x = "True Values", y = "Predicted Values") +
theme_minimal()
ggplotly(fig_line) # Convert to interactive plotly plot
## `geom_smooth()` using formula = 'y ~ x'
#Conclution: Based on your successful development and evaluation of the Decision Tree, K-Nearest Neighbors (KNN), and Random Forest regression models for predicting basketball points in an NBA dataset, the following conclusion can be drawn:
Model Performance and Accuracy:
The Random Forest Regressor was the most accurate of the three models (Decision Tree, KNN, and Random Forest). This implies that Random Forest’s ensemble technique, which mixes numerous decision trees to provide more robust and generalized predictions, is well-suited for this type of data. While the Decision Tree model is intuitively simpler and easier to read, it may not have captured the complexity in the data as well as the Random Forest model. Overfitting is a common problem for decision trees, especially when dealing with complicated and diverse datasets like those used in sports analytics. The KNN model, which makes predictions based on distance measurements, may not have performed well. This could be owing to the high dimensionality of the data or the necessity for careful feature scaling and parameter selection (‘k’). KNN models are highly sensitive to data scale, and any imbalances or outliers can have a major impact on their performance.
Insights from Model Evaluation:
The evaluation criteria used to analyze the models (such as MSE, R2, and MAE) provided a full understanding of their prediction capabilities. Lower MSE and MAE values, as well as higher R2 values, demonstrate the Random Forest model’s increased ability to anticipate total points scored by NBA players. Such measurements are essential not only for determining model correctness, but also for comprehending the types of mistakes created by each model. This knowledge can be used to guide future model development and optimization.